feat(index): add streaming ivf kmeans training#6913
Open
BubbleCal wants to merge 11 commits into
Open
Conversation
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Xuanwo
approved these changes
May 26, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds LanceStream, a streaming IVF kmeans training path that keeps peak raw-vector training memory bounded for very large IVF partition counts while preserving the existing non-streaming trainer when streaming is not requested.
The latest update also bounds the sampling read buffer: fixed sampled ranges are split into smaller range reads and
take_scanreadahead defaults to one batch. This avoids the previous RSS spike from large contiguous sampled ranges while keeping an environment override for benchmark tuning.Feature
streaming_sample_rate,streaming_coreset_rate, andstreaming_refine_passesto IVF build parameters.num_partitions > 256on hierarchical kmeans; large-k streaming does not fall back to flat kmeans.take_scan; default range chunk is 8192 rows.LANCE_STREAMING_IVF_PREFETCH_DEPTHandLANCE_STREAMING_IVF_TAKE_RANGE_ROWSfor experimental tuning. Default prefetch depth is 1 because deeper prefetch gave minimal speedup and higher RSS.Algorithm
The existing IVF trainer samples up to
num_partitions * sample_rateraw vectors before training. For very largenum_partitions, that materializes too many raw vectors at once. LanceStream decouples total sample size from peak raw-vector memory.The streaming path works as follows:
num_partitions * sample_rateand dataset row count.num_partitions * streaming_sample_rateraw vectors.num_partitions * streaming_coreset_rate.num_partitions.streaming_refine_passesraw-vector Lloyd passes over the same fixed sample. Each pass streams chunks, assigns raw vectors to current centroids, accumulates global sums/counts/loss, updates non-empty centroids at pass end, and preserves old centroids for empty clusters.Peak raw-vector memory is bounded by:
The coreset budget is separately bounded by:
Centroid and accumulator buffers still scale with
num_partitions * dimension.Benchmarks
Setup
c2-standard-16, 16 vCPU, ~64 GiB RAM, AVX512, 500 GB disk.float32.k=1024on SIFT1M subset;k>=4096on SIFT1B prefix.k * 256rows for unified exact-loss evaluation; training sample size alsok * 256.stream=64,coreset=16,prefetch=1,refine=0,sample_rate=256,max_iters=50.stream,coreset, andprefetch_depth.Loss Methodology
All comparable losses below are computed outside the training algorithm:
k * 256vectors from the dataset.faiss.IndexFlatL2:No table below uses algorithm-internal reported loss for comparison. Exact loss is reported for
k <= 16,384. Fork=65,536and131,072, exact loss was not run because one evaluation is roughlyO(k * k * 256)distance work; those rows report train time/RSS/status only.LanceStream Tuning
Percentages are relative to LanceStream
stream=64, coreset=16, prefetch=1at the samek.Takeaways:
prefetch_depthis not a good default. At 128K, p16 was only 1.6% faster than p1 but used 6.7% more RSS. At smaller k it often increased RSS much more.coresetfrom 16 to 8 is the strongest speed/RSS knob, but it can raise loss. At 16K it was 38.7% faster and 12.4% lower RSS, but loss was 6.0% worse.streamlowers RSS but does not necessarily improve time. At 128K,stream=16, coreset=8used only 2.39 GiB but took 17.0 minutes.Algorithm Comparison
Percentages are relative to LanceStream
stream=64, coreset=16, prefetch=1at the samek.Large-k notes:
k * 256raw sample; at 128K it used 32.62 GiB RSS on this 128D dataset.stream=16, coreset=8variant used 2.39 GiB.Validation
cargo fmt --all --checkcargo check -p lance --lockedcargo test -p lance --lib test_split_ranges_by_row_count --lockedon a clean temp worktree containing only the productionivf.rschangegit diff --check